Advance Analytics with R (UG 21-24)
I am Ayush.
I am a researcher working at the intersection of data, law, development and economics.
I teach Data Science using R at Gokhale Institute of Politics and Economics
I am a RStudio (Posit) certified tidyverse Instructor.
I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.
Reach me
ayush.ap58@gmail.com
ayush.patel@gipe.ac.in
Helps us make and verify claims about a population using sample estimates.
Helps us make and verify causal claims.
refer: Statistics by David Freedman
Write a simulation that will implement the previous example.
refer: Intro to Modern Statics
get_diffs <- function(...){
decisions <- ifelse(
openintro::sex_discrimination$decision == "promoted",
1,0
)
sample(decisions,
size = 48,
replace = F) -> shuffled_decision
mean(shuffled_decision[1:24]) - mean(shuffled_decision[24:48])
}
map_dbl(c(1:100),
get_diffs) [1] 0.03000000 -0.05166667 -0.05166667 0.11166667 0.11166667 -0.13333333
[7] 0.03000000 0.03000000 -0.09333333 0.03000000 -0.05166667 -0.05166667
[13] 0.19333333 -0.21500000 0.03000000 -0.09333333 -0.05166667 0.03000000
[19] -0.01166667 -0.05166667 -0.25666667 -0.05166667 -0.13333333 -0.21500000
[25] -0.05166667 -0.13333333 -0.13333333 0.11166667 0.11166667 -0.05166667
[31] -0.13333333 0.03000000 -0.09333333 0.07000000 0.03000000 -0.05166667
[37] 0.03000000 0.11166667 -0.13333333 -0.05166667 0.03000000 0.03000000
[43] -0.09333333 0.03000000 0.19333333 0.19333333 0.19333333 -0.05166667
[49] -0.09333333 0.03000000 -0.13333333 -0.29666667 0.19333333 0.03000000
[55] -0.05166667 0.03000000 0.03000000 -0.09333333 0.15166667 0.27500000
[61] -0.05166667 0.03000000 -0.01166667 0.03000000 -0.13333333 -0.21500000
[67] 0.11166667 0.15166667 -0.05166667 -0.21500000 -0.13333333 -0.13333333
[73] -0.05166667 -0.21500000 -0.01166667 0.19333333 -0.13333333 0.03000000
[79] -0.05166667 -0.05166667 -0.01166667 0.11166667 0.19333333 -0.05166667
[85] 0.15166667 -0.05166667 -0.01166667 -0.13333333 -0.05166667 -0.13333333
[91] -0.01166667 0.03000000 0.07000000 0.11166667 -0.21500000 0.07000000
[97] 0.11166667 0.27500000 0.03000000 0.27500000
openintro::sex_discrimination|>
infer::specify(decision ~ sex,
success = "promoted")|>
infer::hypothesise(null = "independence")|>
infer::generate(reps = 100,
type = "permute")|>
infer::calculate(stat = "diff in props",
order = c("male","female"))Response: decision (factor)
Explanatory: sex (factor)
Null Hypothesis: independence
# A tibble: 100 × 2
replicate stat
<int> <dbl>
1 1 0.0417
2 2 -0.0417
3 3 -0.0417
4 4 -0.125
5 5 -0.0417
6 6 0.0417
7 7 0.0417
8 8 -0.0417
9 9 -0.0417
10 10 -0.0417
# ℹ 90 more rows
but what is convenient
tibble(
base_diffs = map_dbl(c(1:1000),
get_diffs),
infer_diffs = openintro::sex_discrimination|>
infer::specify(decision ~ sex,
success = "promoted")|>
infer::hypothesise(null = "independence")|>
infer::generate(reps = 1000,
type = "permute")|>
infer::calculate(stat = "diff in props",
order = c("male","female"))|>
pull(stat)
) -> differences_in_props
differences_in_props# A tibble: 1,000 × 2
base_diffs infer_diffs
<dbl> <dbl>
1 -0.0517 -0.0417
2 -0.0517 0.208
3 -0.0117 -0.125
4 -0.215 -0.0417
5 0.193 -0.125
6 -0.0933 0.0417
7 -0.257 0.0417
8 -0.133 -0.0417
9 0.0300 -0.125
10 0.112 -0.0417
# ℹ 990 more rows
openintro::sex_discrimination|>
infer::specify(decision ~ sex,
success = "promoted")|>
infer::hypothesise(null = "independence")|>
infer::generate(reps = 1000,
type = "permute")|>
infer::calculate(stat = "diff in props",
order = c("male","female"))|>
infer::get_p_value(obs_stat = 0.292,
direction = "right")# A tibble: 1 × 1
p_value
<dbl>
1 0.003
Use both methods for the sex discrimination study, iterate several times. Show that both methods generate similar p-values.
Discuss the coin flip example from ISLR
Note
If we test hundreds of Null Hypothesis, and reject all with p-value 0.01, How many type 1 errors should be expected.
The probablity of making at least one Type I error when testing m hypothesis.
| Decision | \(H_o\) is True | \(H_o\) is False | Total |
|---|---|---|---|
| Reject \(H_o\) | V | S | R |
| Do Not Reject \(H_o\) | U | W | m - R |
| Total | \(m_o\) | m-\(m_o\) | m |
\(FWER = Pr(V>=1)\)
\[FWER(\alpha) = 1 - Pr(V = 0)\]
\[FWER(\alpha) = 1 - \prod_{j=1}^m (1-\alpha) = 1 - (1-\alpha)^m\]
'data.frame': 50 obs. of 2000 variables:
$ Manager1 : num -3.34 3.76 12.97 -4.87 2.02 ...
$ Manager2 : num -4.17 12.53 -2.58 7.98 -5.37 ...
$ Manager3 : num 9.389 3.403 -0.825 -4.027 -4.855 ...
$ Manager4 : num 8.417 0.144 6.585 -4.732 10.594 ...
$ Manager5 : num 0.998 -7.222 17.05 0.503 -6.892 ...
$ Manager6 : num 7.1915 0.0677 1.8571 0.7402 9.8778 ...
$ Manager7 : num -10.77 -10.74 3.2 -28.97 1.43 ...
$ Manager8 : num 4.07 -1.14 -7.98 4.68 9.84 ...
$ Manager9 : num 1.575 -7.167 -1.214 -0.569 5.311 ...
$ Manager10 : num -0.799 4.779 2.338 -4.001 18.365 ...
$ Manager11 : num -8.829 -0.457 -10.853 -9.79 -3.576 ...
$ Manager12 : num 2.757 3.703 5.726 7.952 0.325 ...
$ Manager13 : num -2.21 1.78 -1.33 -0.37 6.94 ...
$ Manager14 : num 17.3 21.34 8.81 -5.36 19.09 ...
$ Manager15 : num 2.82 -4.56 -8.56 4.73 -5.75 ...
$ Manager16 : num 0.162 -6.011 -1.394 -14.178 -3.004 ...
$ Manager17 : num -0.518 -9.804 4.922 1.826 2.854 ...
$ Manager18 : num -5.05 0.731 7.633 -2.493 6.278 ...
$ Manager19 : num -2.35 3.38 -3.86 -4.76 -4.33 ...
$ Manager20 : num -8.88 -3.31 -9.69 9.41 -1.15 ...
$ Manager21 : num -0.621 3.753 -13.275 -4.449 -12.233 ...
$ Manager22 : num -0.989 3.994 -5.297 -10.722 -9.546 ...
$ Manager23 : num 3.138 -2.519 -3.04 1.939 -0.689 ...
$ Manager24 : num -23.11 -9.07 3.98 -11.01 -10.34 ...
$ Manager25 : num 7.71 -6.32 1.99 11.65 2.32 ...
$ Manager26 : num -1.23 -1.76 -2.53 4.89 1.92 ...
$ Manager27 : num -19.398 0.638 6.21 -1.795 -17.354 ...
$ Manager28 : num -1.8965 -2.755 9.4749 -7.8057 -0.0393 ...
$ Manager29 : num 8.623 13.635 0.439 -5.011 3.697 ...
$ Manager30 : num 5.549 4.239 12.66 -0.417 7.12 ...
$ Manager31 : num -3.61 3.35 14.82 2.97 1.88 ...
$ Manager32 : num -4.33 -11.25 -5.7 13.54 -2.39 ...
$ Manager33 : num -0.216 -4.071 -1.429 5.625 -6.485 ...
$ Manager34 : num 3.17 -1.66 -1.72 1.1 -1.94 ...
$ Manager35 : num 12.17 6.33 9.25 6.53 10.73 ...
$ Manager36 : num -1.167 -2.71 -1.845 0.338 -1.918 ...
$ Manager37 : num 2.263 -1.988 0.361 1.516 -4.077 ...
$ Manager38 : num -5.07 1.84 -2.58 7.84 5.38 ...
$ Manager39 : num 5.189 5.454 2.291 -0.847 -4.622 ...
$ Manager40 : num 2.26 -3.83 -3.73 -5.55 -4.12 ...
$ Manager41 : num -13.49 3.51 -6.69 2.16 2.88 ...
$ Manager42 : num 1.63 1.26 -2.59 6.19 -11.35 ...
$ Manager43 : num -0.603 1.975 -7.385 4.527 -1.358 ...
$ Manager44 : num 8.949 -16.9698 -7.4223 -16.1439 -0.0706 ...
$ Manager45 : num -4.76 -3.88 -1.61 -6.66 -11.18 ...
$ Manager46 : num 1.253 -0.489 -0.101 -0.74 -3.132 ...
$ Manager47 : num -3.88 -4.98 12.05 15.64 16.13 ...
$ Manager48 : num -2.94 -11.23 7.15 -12.89 -5.26 ...
$ Manager49 : num -7 -0.541 1.899 -3.422 -7.338 ...
$ Manager50 : num 13.179 -0.495 9.227 -7.344 1.01 ...
$ Manager51 : num -1.364 -2.022 0.807 -1.783 -2.933 ...
$ Manager52 : num -4.58 -11.95 9.83 15.42 24.64 ...
$ Manager53 : num 0.405 -0.1178 -0.0948 -1.4143 -2.5958 ...
$ Manager54 : num 15.22 8.22 11.86 -3.07 16.02 ...
$ Manager55 : num -1.39 1.37 1.16 2.97 -2.09 ...
$ Manager56 : num 8.39 -5.82 11.71 3.68 1.35 ...
$ Manager57 : num 9.98 -12.02 4.31 28.43 13.25 ...
$ Manager58 : num 12.5 2.373 -2.01 3.105 -0.846 ...
$ Manager59 : num -7.81 -3.29 2.4 1.64 -3.32 ...
$ Manager60 : num 11.641 0.889 -5.322 -0.871 22.7 ...
$ Manager61 : num -14.79 -8.06 -2.34 2.98 3.64 ...
$ Manager62 : num -10.858 -0.878 -10.937 0.84 -13.339 ...
$ Manager63 : num -2.91 -3.54 -0.26 2.18 -2.23 ...
$ Manager64 : num -7.89 -10.9 11.52 8.85 -7.07 ...
$ Manager65 : num 2.86 10.38 5.33 -1.1 3.73 ...
$ Manager66 : num -0.518 2.597 2.695 4.578 -0.952 ...
$ Manager67 : num -2.5425 -1.1018 -0.0331 2.645 1.4676 ...
$ Manager68 : num -2.349 0.296 7.415 14.348 -2.833 ...
$ Manager69 : num -9.062 -0.225 -8.309 -6.805 -2.778 ...
$ Manager70 : num 12.86 7.57 -2.16 14.47 -7.63 ...
$ Manager71 : num 0.68 2.65 5.08 1.73 -4.86 ...
$ Manager72 : num -4.17 -1.85 -5.81 -19.2 -14.14 ...
$ Manager73 : num 0.0154 1.438 10.7116 5.3531 -5.9821 ...
$ Manager74 : num 2.85 2.07 -7.73 5.29 -9.92 ...
$ Manager75 : num 10.19 11.46 7.9 7.07 7 ...
$ Manager76 : num -2.763 5.101 1.383 -0.439 4.541 ...
$ Manager77 : num 6.27 -1.46 6.25 16.86 -3.69 ...
$ Manager78 : num 5.18 12.4 -7.1 4.36 6.09 ...
$ Manager79 : num 4.648 -2.28 1.071 3.254 0.521 ...
$ Manager80 : num 1.44 7.53 -16.5 1.38 3.86 ...
$ Manager81 : num 8.25 -6.41 -1.02 -6.71 5.07 ...
$ Manager82 : num 1.19 8.57 -1.67 5.95 -18.16 ...
$ Manager83 : num 11.86 -7.89 12.85 -2.2 8.2 ...
$ Manager84 : num -19.62 -4.52 2.34 16.13 1.28 ...
$ Manager85 : num 4.7742 -8.3646 0.0774 -1.2245 1.1547 ...
$ Manager86 : num 11.32 1.9 6 -5.43 -1.7 ...
$ Manager87 : num -6.41 -4 -7.02 14.22 -6.59 ...
$ Manager88 : num -10.91 7.3 -19.51 -5.6 2.94 ...
$ Manager89 : num -4.84 -0.144 -9.604 2.514 -4.605 ...
$ Manager90 : num 12.58 -6.34 20.08 4.06 5.93 ...
$ Manager91 : num -0.959 0.802 16.763 6.745 7.085 ...
$ Manager92 : num 0.335 0.097 -0.268 0.885 1.061 ...
$ Manager93 : num -0.6587 -1.8218 1.1954 -0.0847 4.0122 ...
$ Manager94 : num -13.477 -6.263 -10.25 0.906 -6.278 ...
$ Manager95 : num -5.98 -21.23 -18.13 6.98 -2.21 ...
$ Manager96 : num -3.315 9.382 5.111 -5.099 0.673 ...
$ Manager97 : num 1.753 2.697 -2.878 -4.72 0.147 ...
$ Manager98 : num -10.5 10.72 -5.41 -9.76 -4.94 ...
$ Manager99 : num -11.086 -15.652 1.891 -8.532 -0.665 ...
[list output truncated]
\[ FWER = Pr(making at least 1 type 1 error)\]
Assume that \(A_j\) is an event where we make a type1 error.
\(FWER = Pr(\bigcup_{j=1}^m A_j)\)
\(FWER <= \sum_{j=1}^m Pr(A_j)\)
\(FWER(\alpha/m) <= m * \frac{\alpha}{m}\)
On the Fund data run independent t test (\(\alpha = 0.05\))for each fund manager to see if average excess returns are nonzero. Then run again with Bonferroni correction and see how many are still non-zero.
Procedure:
## run t-test for all fund managers at 0.05
Fund|>
summarise(
across(
everything(),
.fns = ~ t.test(.x, mu = 0)$p.value
)
)|>
pivot_longer(
everything(),
names_to = "Manager",
values_to = "Independent_t.test_pval"
) -> all_test_pvalues
head(all_test_pvalues)# A tibble: 6 × 2
Manager Independent_t.test_pval
<chr> <dbl>
1 Manager1 0.00620
2 Manager2 0.918
3 Manager3 0.0116
4 Manager4 0.601
5 Manager5 0.756
6 Manager6 0.000965
# mark where null is rejected using bonferroni
all_test_pvalues|>
mutate(
reject_null_for_bonferroni = Independent_t.test_pval <= (0.05/2000)
) -> all_test_pvalues
head(all_test_pvalues)# A tibble: 6 × 3
Manager Independent_t.test_pval reject_null_for_bonferroni
<chr> <dbl> <lgl>
1 Manager1 0.00620 FALSE
2 Manager2 0.918 FALSE
3 Manager3 0.0116 FALSE
4 Manager4 0.601 FALSE
5 Manager5 0.756 FALSE
6 Manager6 0.000965 FALSE
## Execute the Holm's Step-Down procedure
all_test_pvalues |>
arrange(Independent_t.test_pval) |> # step 2 ascending
mutate(
index = c(1:2000),
is_L_val = (0.05/(2000+1-index)),
is_L = Independent_t.test_pval > (0.05/(2000+1-index))
)|>
filter(is_L == T & is_L_val == min(is_L_val))|>
pull(Independent_t.test_pval) -> holms_L
## reject null hypothesis where pvalues are less than L
all_test_pvalues|>
mutate(
reject_null_for_holms = Independent_t.test_pval < holms_L
)|>
### compare results
summarise(
rejected_nulls_vanilla = sum(Independent_t.test_pval<=0.05),
rejected_nulls_bonferroni = sum(reject_null_for_bonferroni),
rejected_nulls_holms = sum(reject_null_for_holms)
)|>
gt::gt()| rejected_nulls_vanilla | rejected_nulls_bonferroni | rejected_nulls_holms |
|---|---|---|
| 289 | 0 | 0 |
Preventing all false positives can cost of power of the test.
So, we try and ensure that \(\frac{V}{R}, R = V + S (sum of all positives),\) is minimum. This ratio is the False Discovery Proportion.
False Discovery Rate is the expectation of FDP. We try and control this.